Introduction

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g educaiton, marital status, criminal history, drug use, childhood household factors, profession, etc.)?

To answer this question, I looked at data from the National Longitudinal Survey of Youth, 1979 cohort. I began by cleaning the data and followed by creating tables and bar charts to explore the data.

Data Summary

#Install packages


library(kableExtra)
library(plyr)
library(ggplot2)
library(knitr)
library(MASS)
options(scipen=4)

#Retrieve data
nlsy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/final_project/nlsy79/nlsy79_income.csv", header=TRUE)
#Subset variables
nlsy.var <- c("R0000700", "R0001200", "R0214700", "R0214800", "R0217502", "T3977400", "R3401501")

#Recode variables
nlsy.fp <- nlsy[nlsy.var]
colnames(nlsy.fp) <- c("country.birth", "foreign.lang.spoken", "race", "sex", "fam.size", "income", "highest.grade")

str(nlsy.fp)
## 'data.frame':    12686 obs. of  7 variables:
##  $ country.birth      : int  1 2 1 1 1 1 1 1 1 1 ...
##  $ foreign.lang.spoken: int  4 4 -4 -4 -4 -4 -4 4 -4 -4 ...
##  $ race               : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ sex                : int  2 2 2 2 1 1 1 2 1 2 ...
##  $ fam.size           : int  5 5 5 5 4 4 3 3 6 3 ...
##  $ income             : int  -5 19000 35000 -5 -5 105000 -5 40000 75000 -5 ...
##  $ highest.grade      : int  -5 12 10 14 -5 16 12 12 14 9 ...
#Convert variables to factors and recode 
nlsy.fp <- mutate(nlsy.fp, 
          country.birth = as.factor(mapvalues(country.birth, 
                                     c(1, 2, -3), 
                                     c("In the US", "In other country", NA))),
          foreign.lang.spoken = as.factor(mapvalues(foreign.lang.spoken, 
                                    c(1, 2, 3, 4,-4, -3, -2), 
                                    c("Spanish", "French", "German", "Other", "No Foreign Language", "No Foreign Language", "No Foreign Language"))),
          race = as.factor(mapvalues(race, 
                                    c(1, 2, 3), 
                                    c("Hispanic", "Black", "Non-Black, Non-Hispanic"))),
          sex = as.factor(mapvalues(sex,
                                    c(1, 2), 
                                    c("Male", "Female"))), 
          highest.grade = as.factor(mapvalues(highest.grade,
                                    c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 95, -3, -5), 
                                    c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More", NA, NA, NA))))
## The following `from` values were not present in `x`: 2, 95
#Reorder grades variable
nlsy.fp$highest.grade <- factor(nlsy.fp$highest.grade , levels = c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More", NA, NA, NA))

#Remove non-positives
nlsy.fp$income[nlsy.fp$income < 0] <- NA

#Remove topcoded variables
nlsy.fp$income[nlsy.fp$income == 343830] <- NA

#Remove NAs in variables
nlsy.fp <- subset(nlsy.fp, !is.na(country.birth), select=country.birth:highest.grade) #Country of birth
nlsy.fp <- subset(nlsy.fp, !is.na(highest.grade), select=country.birth:highest.grade) #Highest grade completed
#Non-positive values from the income variable were removed due to invalid responses to the survey. The top 2 percent of the income variable were also removed to keep the data to a manageable size. The top 2 percent were so far removed from the rest of the data that its presence distorted graphs and charts. 
#Summary of the clean data
summary(nlsy.fp)
##           country.birth           foreign.lang.spoken
##  In other country: 687   French             :  86    
##  In the US       :9711   German             :  71    
##                          No Foreign Language:8221    
##                          Other              : 422    
##                          Spanish            :1598    
##                                                      
##                                                      
##                       race          sex          fam.size     
##  Black                  :2714   Female:5305   Min.   : 1.000  
##  Hispanic               :1719   Male  :5093   1st Qu.: 3.000  
##  Non-Black, Non-Hispanic:5965                 Median : 4.000  
##                                               Mean   : 4.552  
##                                               3rd Qu.: 6.000  
##                                               Max.   :15.000  
##                                                               
##      income                highest.grade 
##  Min.   :     0   12th Grade      :4513  
##  1st Qu.:   100   4th Year College:1241  
##  Median : 28800   2nd Year College: 889  
##  Mean   : 35149   1st Year College: 873  
##  3rd Qu.: 54000   11th Grade      : 495  
##  Max.   :178000   3rd Year College: 461  
##  NA's   :3807     (Other)         :1926
##average income by sex
in.sex<- ddply(nlsy.fp, ~ sex, summarize, 
               income.sex = mean(income, na.rm = TRUE))
kable(in.sex, format = "html")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
sex income.sex
Female 28731.39
Male 42323.52

From an initial view of average income by sex, Males earned 42323.52 and Females earned 28731.39 with a difference of 13592.14 dollars.

#Table of the number of respondents broken down by gender and race
simpl.tbl <- addmargins(with(nlsy.fp, table(as.array(sex), as.array(race))))
kable(simpl.tbl, format =  "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Black Hispanic Non-Black, Non-Hispanic Sum
Female 1352 868 3085 5305
Male 1362 851 2880 5093
Sum 2714 1719 5965 10398

I also wanted to analyze if a difference of income existed among races. But first, I wanted to know how many respondents were within each category. From the 10398 respondents of the survey, there were 5093 Male respondents and 5305 Female respondents. Among the Males there were: 1362 Black respondents, 851 Hispanic respondents, and 2880 Non-Black, Non-Hispanic respondents. Among the Females there were: 1352 Black respondents, 868 Hispanic respondents and 3085 Non-Black, Non-Hispanic respondents.

##Average income by race
in.race<- ddply(nlsy.fp, ~ race, summarize, 
                income.race = mean(income, na.rm= TRUE))
kable(in.race, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
race income.race
Black 26915.25
Hispanic 32768.26
Non-Black, Non-Hispanic 41241.44
#Create a bar chart
ggplot(data = in.race, 
       aes(x = race, 
           y = income.race)) + 
      geom_bar(stat = "identity") +
      xlab("") +
      ylab("Average Income") +
      ggtitle("Average Income by Race")

After understanding how the data is broken up by gender, a table of average income by race was calculated. Non-Black, Non-Hispanics, on average, earned the most with 41241.44 dollars. Hispanic respondents earned the second highest with 32768.26 dollars and Black respondents earned 26915.25 dollars.

#Average income by sex and race table
income.race.sex<- with(nlsy.fp, round(tapply(income, INDEX = list(race, sex), FUN = mean, na.rm= TRUE)))
kable(income.race.sex, format = "html")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Female Male
Black 24161 29899
Hispanic 27422 38708
Non-Black, Non-Hispanic 32030 51770

By breaking the data between genders, Males at each of the three races earned more than Females. Non-Black, Non-Whites Males earned the most between the 6 groups with 51770 dollars, compared to 32030 dollars for Non-Black, Non-Hispanics Females. Hispanic males earned 38708 dollars compared to 27422 dollars Hispanic Females. Black Males earned 29899 dollars compared to 24161 dollars Black Females.

#Create a dataframe
income.race.sex.dif <- ddply(nlsy.fp, ~ race, summarize, 
                  in.gap = mean(income[sex == "Male"], na.rm = TRUE) - mean(income[sex == "Female"], na.rm = TRUE))
                
#Plot data frame
income.race.sex.dif.plot <- ggplot(data = income.race.sex.dif, aes(x = race, y = in.gap, fill = race)) +
  geom_bar(stat = "identity") +
  xlab("") + 
  ylab("Difference of Income") +
  ggtitle("Difference of Income between Men and Women") + 
  guides(fill = FALSE) 

income.race.sex.dif.plot

To get a better understanding of how large the difference is between the races, a bar chart of the difference of average income is plotted. The higher the bar, the larger the difference. There seems to be a larger difference in average income among Non-Black, Non-Hispanic respondents.

# Create a data frame
base.df<- ddply(nlsy.fp, ~ race + sex, summarize, 
             mean.income = mean(income, na.rm= TRUE),
             size= length(income),
             standard.deviation= sd(income, na.rm=T),
             standard.error.mean.income= round(standard.deviation/ sqrt(size),2),
             lower = mean.income - standard.deviation, 
             upper= mean.income + standard.deviation)
kable(base.df, format = "html")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))
race sex mean.income size standard.deviation standard.error.mean.income lower upper
Black Female 24161.04 1352 26882.23 731.10 -2721.1861 51043.27
Black Male 29898.98 1362 33231.11 900.44 -3332.1266 63130.09
Hispanic Female 27422.25 868 29441.79 999.32 -2019.5415 56864.05
Hispanic Male 38708.27 851 35973.71 1233.16 2734.5639 74681.98
Non-Black, Non-Hispanic Female 32029.78 3085 32502.83 585.19 -473.0494 64532.62
Non-Black, Non-Hispanic Male 51769.92 2880 40176.26 748.64 11593.6511 91946.18

The standard deviation at every race also shows that there is more variability between Males than Females. For example, the standard deviation of income for Black Males is 33231.11 dollars compared to Black Females of 26882.23 dollars. This shows that among Black Males, there is greater variability to earn income than Black Females. The standard error of the mean for Females at each of the 3 races shows that there is a significant difference between genders, for example the mean standard error for Hispanic Males income is 1233.16 dollars compared to the mean standard error for Hispanic Females income of 999.32 dollars. The smaller the number, the more representative the value is of the true population.

Methodology

The variables I chose to answer this question were: - Whether the respondent was born in the United States I wanted to use this question to help determine if being an U.S. citizen had a positive effect on income. It is easier for U.S. citizens to get jobs than non U.S citizens due to sponsorship.

Within the income variable, the top 2% of highest incomes were “top coded” which means that we do not see the actual incomes for the 2% of earners. For the top 2% of earners, the income variable is the average income of the 2% of earners. After exploring the bar charts and tables, I removed the top coded variables because the values were too high that it altered the charts. They also seemed to be outliers and not an indication as to whether they would contribute to the study.

After careful analysis, I also removed all non positives from my data. I did recode some of the negative values based on the data dictionary, such as the foreign language question to ‘no foreign language spoken’. The rest of the non-positive values were small quantities that it would not have impacted the rest of the study.

There were 212 more Female respondents in the study but out of the 10398 total respondents, the distribution seemed fine. After breaking it down by race, the data seem to be reflect the current population by having more Non Black, Non Hispanic respondents, followed by Black respondents and ending with Hispanic respondents.

The graph titled “Difference of Income between Males and Females” was interesting to see because I expected there to be a difference but not large enough like the one between Non Black, Non Hispanic Males and Females.

I was also surprised to see that in each of the 3 races, there was a noticeable difference of income between Males and Females. I had expected that there would be very little if any difference among the Black and Hispanic race.

The next section will go into the main findings and how these variables impact the income between Males and Females.

Findings

#plot data
base.df.plot <- ggplot(base.df, 
                    aes(x= sex, 
                        y= mean.income,
                        fill= race)
                    ) +
                    geom_col(stat = "identity", position= "dodge") +
                    geom_errorbar(aes(ymin = lower, 
                                      ymax = upper), 
                                      width = .2, position= position_dodge(0.9)) +
                    xlab("") +
                    ylab("Average Income") +
                    ggtitle("Average Income Distribution by Gender Among Race")
## Warning: Ignoring unknown parameters: stat
base.df.plot

The error bars show that I am 95 percent confident that the percentage of Non Black, Non Hispanic respondents are between 11593.65 and 91946.18.

# qq plot
with(nlsy.fp, qqnorm(income[sex=="Male"], main = "Normal Q-Q Plot for Males"))
# add reference line
with(nlsy.fp, qqline(income[sex=="Male"], col = "red"))

To assess if the observed data follows a normal distribution, a QQ plot was performed. Among Males, the data is not perfectly normal. The upward curving suggests that there is a high positive skew and the bottom suggests that there are many low values.

# qq plot
with(nlsy.fp, qqnorm(income[sex=="Female"], main = "Normal Q-Q Plot for Females"))
# add reference line
with(nlsy.fp, qqline(income[sex=="Female"], col = "red"))

The the QQplot for Females follows the same distribution as it is not perfectly normal. The top values are more off normal than Males which also suggests that there is high positive skewness.

Sex T.test

income.sex.t.test <- t.test(income ~ sex, data = nlsy.fp)
income.sex.t.test
## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = -15.754, df = 5911.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15283.48 -11900.80
## sample estimates:
## mean in group Female   mean in group Male 
##             28731.39             42323.52
options(scipen=4)

To tests the averages between two groups, a t test was done. The p-value for this t test is 0 and the t statistic is -15.75, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t- test suggests that Males do have higher income. The confidence intervals are -15283.48 and -11900.8 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Females is 28731.39 and the mean income for Males is 42323.52.

The next three t tests take a look based on race:

Non-Black, Non-Hispanic t.test

income.sex.t.test2 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Non-Black, Non-Hispanic"))
income.sex.t.test2
## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = -15.3, df = 2928.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22269.87 -17210.39
## sample estimates:
## mean in group Female   mean in group Male 
##             32029.78             51769.92

Among the Non-Black, Non-Hispanic race, the p-value for the t-test is 0 and the t statistic is -15.3, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Non-Black, Non-Hispanic Males do have higher income. The confidence intervals are -22269.87 and -17210.39 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Non-Black, Non-Hispanic Females is 32029.78 and the mean income for Non-Black, Non-Hispanic Males is 51769.92.

Hispanics t.test

income.sex.t.test3 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Hispanic"))
income.sex.t.test3
## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = -6.0851, df = 1165.5, p-value = 1.577e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14924.923  -7647.113
## sample estimates:
## mean in group Female   mean in group Male 
##             27422.25             38708.27

Among the Hispanic race, the p-value for the t-test is 0 and the t statistic is -6.09, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Hispanic Males do have higher income. The confidence intervals are -14924.92 and -7647.11 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Hispanic Females is 27422.25 and the mean income for Hispanic Males is 38708.27.

Blacks t.test

income.sex.t.test4 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Black"))
income.sex.t.test4
## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = -4.2766, df = 1892, p-value = 0.00001992
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8369.317 -3106.560
## sample estimates:
## mean in group Female   mean in group Male 
##             24161.04             29898.98

Among the Black race, the p-value for the t-test is 0 and the t statistic is -4.28, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Black Males do have higher income. The confidence intervals are -8369.32 and -3106.56 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Black Females is 24161.04 and the mean income for Black Males is 29898.98.

gender.income.boxplot <- ggplot( data = nlsy.fp, 
      aes(x = sex, 
          y = income, na.rm = T, 
          fill = race)) +
      geom_boxplot() +
      xlab("") +
      ylab("Income") +
      ggtitle("Income Distribution by Gender Among Race") 

gender.income.boxplot
## Warning: Removed 3807 rows containing non-finite values (stat_boxplot).

This box plot shows a better representation of where the respondents average income is separated by gender and race. The average income among Females lies relatively close to each other with Black females earning 24161 dollars and Non-Black, Non-Hispanic earning the most with 32030 dollars. There are larger gaps between average incomes among Males. Black males average income is the lowest with 29899 dollars and at about the same level as Hispanic Females with 27422. The average income for Non Black, Non Hispanic males is the largest at 51770 dollars and is over the third quartile of Black Females. There are some outliers in each box plot and most of them are above the $150,000 range.

#create a dataframe
sex.country.income<- ddply(nlsy.fp, ~ country.birth + sex, summarize, mean.income = mean(income, na.rm= TRUE))

#subset dataframe 
sex.country.income<- subset(sex.country.income, !is.na(country.birth), select= c(country.birth, sex, mean.income))

#plot dataframe
sex.country.income.plot <- ggplot(data = sex.country.income, 
      aes(x = country.birth, 
          y = mean.income,
          fill = sex)) + 
      geom_histogram(stat = "identity", position= "dodge") +
      xlab("") +
      ylab("Average Income") +
      ggtitle("Average Income by Country of Birth Among Gender")+
      scale_fill_brewer(palette = "Spectral")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
sex.country.income.plot

The bar graph shows that Males continue to earn more than Females, even when the data is broken up by nationality. It is also interesting to note that Females not born in the the U.S. on average, earn 184.67 more than Females born in the U.S. The difference is larger for Males. On average, Males born in another country earn 2686.26 dollars more than Males born in the U.S.

sex.language.income <- ddply(nlsy.fp, ~ foreign.lang.spoken + sex, summarize, mean.income = mean(income, na.rm= TRUE))

sex.language.income.plot <- ggplot(data = sex.language.income, 
      aes(x = foreign.lang.spoken, 
          y = mean.income,
          fill = sex)) + 
      geom_histogram(stat = "identity", position= "dodge") +
      xlab("Foreign Language Spoken at Home") +
      ylab("Average Income") +
      ggtitle("Average Income by Foreign Language Spoken at Home Among Gender") +
      scale_fill_brewer(palette = "Spectral")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
sex.language.income.plot

The same trend continues when looking at the difference of income among gender broken down by foreign language spoken at home. No foreign Language spoken at home, suggesting English only speaking, is the same trend we have been noticing. The biggest difference between genders is among the ‘Other’ language with a difference of 21326.28 followed by a difference of German language of 20188.53.

ggplot(nlsy.fp, aes(x= foreign.lang.spoken, 
                    y= income, 
                    color= sex)) + 
                geom_jitter(alpha = .25) +
                ylab( "Income") +
                xlab("Foreign Language Spoken at Home") +
                ggtitle("Income by Foreign Language Spoken at Home") 
## Warning: Removed 3807 rows containing missing values (geom_point).

Using the jitter function, we can see the depth of each language. There are significantly more dots in the ‘No foreign Language’ spoken than any others. Spanish comes second. The differences mentioned above regarding German and Other have less values than ‘No Foreign Language’ and Spanish. Within this graphic, Males continue to earn more than Females.

##average income by race and grade
sex.grade.income <- ddply(nlsy.fp, ~ sex + highest.grade, summarize,
                       mean.income = round(mean(income, na.rm = TRUE),digits = 2))

#Reorder Grades
sex.grade.income$highest.grade <- factor(sex.grade.income$highest.grade , levels = c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More",  NA, NA, NA))

sex.grade.income.plot <- ggplot(data = sex.grade.income, 
      aes(x = highest.grade, 
          y = mean.income,
          fill = sex)) + 
      geom_bar(stat = "identity", position= "dodge") +
      xlab("Highest Grade Completed") +
      ylab("Average Income") +
      ggtitle("Average Income by Highest Grade Completed Among Gender") + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
      scale_fill_brewer(palette = "Spectral")

sex.grade.income.plot
## Warning: Removed 2 rows containing missing values (geom_bar).

By taking a look at education, there is a correlation with average income and years of education. It appears that the more years of education one has, the more income they earn. Still, the chart shows that Males on average, earn more than Females at every year of education. Although it seems the gap shrinks when Females have 7 or more years of college education. There are two majors jumps between both genders: (1) after receiving 12 years of education (graduating with a high school diploma) and (2) after receiving 4 years of college education (graduating with a Bachelor’s). The disparity between the sexes seems to widen after graduating high school. It is interesting to note there is an outlier for the 3rd grade.

#Scatter plot
ggplot(nlsy.fp, aes(x=as.numeric(highest.grade), 
                    y=income, 
                    color=sex)) + 
                geom_jitter(alpha = .25) +
                geom_smooth()+
                ylab( "Income") +
                xlab("Highest Grade Completed") +
                ggtitle("Income by Highest Grade Completed") 
## `geom_smooth()` using method = 'gam'
## Warning: Removed 3807 rows containing non-finite values (stat_smooth).
## Warning: Removed 3807 rows containing missing values (geom_point).

This chart shows the same information but looks at the total values among Males and Females. Years 12 and 16, high school diploma and Bachelor’s degree, respectively are clearly shown. You can also see the trend increasing for both Males and Females with the curvature of the lines.

final.regression <- lm(income ~ ., data= nlsy.fp)
options(scipen=15)
kable(summary(final.regression)$coef, digits = c(3, 3, 3, 4))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1509.469 22378.191 -0.067 0.9462
country.birthIn the US -5039.243 1735.719 -2.903 0.0037
foreign.lang.spokenGerman -6422.786 6416.237 -1.001 0.3169
foreign.lang.spokenNo Foreign Language -2323.848 4389.581 -0.529 0.5965
foreign.lang.spokenOther -1844.781 4914.013 -0.375 0.7074
foreign.lang.spokenSpanish -7535.432 4910.354 -1.535 0.1249
raceHispanic 11841.049 2477.168 4.780 0.0000
raceNon-Black, Non-Hispanic 10206.605 907.523 11.247 0.0000
sexMale 15387.589 766.079 20.086 0.0000
fam.size -24.046 174.238 -0.138 0.8902
highest.grade1st Grade -6466.938 37859.896 -0.171 0.8644
highest.grade3rd Grade 12988.142 23953.721 0.542 0.5877
highest.grade4th Grade -1838.142 25857.062 -0.071 0.9433
highest.grade5th Grade 2517.396 25234.641 0.100 0.9205
highest.grade6th Grade 4530.280 22495.280 0.201 0.8404
highest.grade7th Grade 4354.194 22299.914 0.195 0.8452
highest.grade8th Grade 5117.236 22039.643 0.232 0.8164
highest.grade9th Grade 8599.148 21961.984 0.392 0.6954
highest.grade10th Grade 11835.556 21949.801 0.539 0.5898
highest.grade11th Grade 13190.014 21937.622 0.601 0.5477
highest.grade12th Grade 24564.203 21879.699 1.123 0.2616
highest.grade1st Year College 33141.560 21909.301 1.513 0.1304
highest.grade2nd Year College 37319.086 21910.684 1.703 0.0886
highest.grade3rd Year College 40066.387 21943.567 1.826 0.0679
highest.grade4th Year College 53515.006 21906.261 2.443 0.0146
highest.grade5th Year College 56257.657 22014.683 2.555 0.0106
highest.grade6th Year College 56316.646 22072.948 2.551 0.0108
highest.grade7th Year College 73701.571 22337.726 3.299 0.0010
highest.grade8th Year College or More 90254.902 22618.425 3.990 0.0001
final.regression.coef <- round(summary(final.regression)$coef, 4)
class(final.regression.coef)

[1] “matrix”

attributes(final.regression.coef)

$dim [1] 29 4

$dimnames $dimnames[[1]][1] “(Intercept)”
[2] “country.birthIn the US”
[3] “foreign.lang.spokenGerman”
[4] “foreign.lang.spokenNo Foreign Language” [5] “foreign.lang.spokenOther”
[6] “foreign.lang.spokenSpanish”
[7] “raceHispanic”
[8] “raceNon-Black, Non-Hispanic”
[9] “sexMale”
[10] “fam.size”
[11] “highest.grade1st Grade”
[12] “highest.grade3rd Grade”
[13] “highest.grade4th Grade”
[14] “highest.grade5th Grade”
[15] “highest.grade6th Grade”
[16] “highest.grade7th Grade”
[17] “highest.grade8th Grade”
[18] “highest.grade9th Grade”
[19] “highest.grade10th Grade”
[20] “highest.grade11th Grade”
[21] “highest.grade12th Grade”
[22] “highest.grade1st Year College”
[23] “highest.grade2nd Year College”
[24] “highest.grade3rd Year College”
[25] “highest.grade4th Year College”
[26] “highest.grade5th Year College”
[27] “highest.grade6th Year College”
[28] “highest.grade7th Year College”
[29] “highest.grade8th Year College or More”

$dimnames[[2]][1] “Estimate” “Std. Error” “t value” “Pr(>|t|)”

Interpretation

There are several statistical predictors of income, such as being born in the U.S., race, and education. The p- value for someone born in the U.S is 0.0037, the p-value for being Hispanic is 0, the p-value for Non-Black, Non-Hispanics with a p value of 0, the p-value for Males is 0, the p-vale for receiving at least 2 years of college education is 0.0886, the p-vale for receiving at least 3 years of college education is 0.0679, the p-vale for receiving at least 4 years of college education is 0.0146, the p-vale for receiving at least 5 years of college education is 0.0106, the p-vale for receiving at least 6 years of college education is 0.0108, the p-vale for receiving at least 7 years of college education is 0.001, and the p-vale for receiving at 8 years of college education or more is 0.0001. All of the coefficients with the exception of country.birthIn the US, are positive suggesting that there is a positive relationship.

The baseline for this regression is Black Females who were not born in the U.S and who were raised speaking French as a child. The interpretations are made off of this baseline. Education is a large driver for income, followed by race and country of origin.

#interaction
regression.interact <- lm(income ~ sex * race, data = nlsy.fp)
kable(summary(regression.interact)$coef, digits = c(3, 3, 3, 4))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24161.042 1034.641 23.352 0.0000
sexMale 5737.938 1493.375 3.842 0.0001
raceHispanic 3261.210 1665.432 1.958 0.0503
raceNon-Black, Non-Hispanic 7868.743 1313.460 5.991 0.0000
sexMale:raceHispanic 5548.080 2413.665 2.299 0.0216
sexMale:raceNon-Black, Non-Hispanic 14002.193 1906.082 7.346 0.0000
regression.interact.coef <- round(summary(regression.interact)$coef, 4)
class(regression.interact.coef)

[1] “matrix”

attributes(regression.interact.coef)

$dim [1] 6 4

$dimnames $dimnames[[1]][1] “(Intercept)”
[2] “sexMale”
[3] “raceHispanic”
[4] “raceNon-Black, Non-Hispanic”
[5] “sexMale:raceHispanic”
[6] “sexMale:raceNon-Black, Non-Hispanic”

$dimnames[[2]][1] “Estimate” “Std. Error” “t value” “Pr(>|t|)”

More Interpretation

To look at specific variables, an interaction on sex and race was conducted. All p-values were statistically significant. On average Black Males earned 5737.94 dollars more than Black Females. On average Hispanic Males earned 11286.02 dollars more than Black Females. On average Non Black, Non Hispanic Males earned 19740.13 dollars more than Black Females.

Discussion

In summary, Males do earn more than Females. The difference varies once it is broken up my race. The difference is larger for Non-Black, Non-Hispanics, followed by Hispanics, and then Blacks. The disparity still exists when the data is broken up by education. Although, the more education one has, the more income they earn, the gap in income still exists. Furthermore, when the data is broken up into country of origin, the gap is still there. Although, it seems that being born in the country has a negative effect on income compared to those being born outside of the country. Family size was not a statistically significant predictor of income.There may be potential confounders for type of industry one works in or length of one’s employment, I have confidence in my conclusion, but further analysis would have to be done to investigate those confounders before presenting the findings to policymakers.